BLAS vector-vector dot by khaotik · Pull Request #294 · Theano/libgpuarray

khaotik · 2016-11-24T09:15:08Z

abergeron · 2016-11-24T15:19:46Z

 #endif

+// only for vector-vector dot
+GPUARRAY_PUBLIC int GpuArray_dot( GpuArray *A, GpuArray *B,


You declare _dot and then use _rdot. That might not work.

abergeron · 2016-11-24T15:22:43Z

-    ctx->err = cuModuleLoadData(&res->m, bin);
+    // for both info/err log
+    cujit_info_log = (char*)malloc(2*cujit_log_size*sizeof(char));
+    if(cujit_info_log == NULL) {


Space after if.

abergeron · 2016-11-24T15:23:01Z


-    ctx->err = cuModuleLoadData(&res->m, bin);
+    // for both info/err log
+    cujit_info_log = (char*)malloc(2*cujit_log_size*sizeof(char));


No need to cast the return of malloc.

I think it's more friendly to possible inclusion from C++ code, but I'll remove it anyway.

The headers should be c++ clean, but the source files will probably not be touched by a c++ compiler.

abergeron · 2016-11-24T15:30:50Z

This PR seems to be more about using cuModuleLoadDataEx than blas dot.

The blas parts should either be completed or removed.

abergeron · 2016-11-24T15:32:29Z

Ok sorry, I just realized that you just reused the same branch for this PR, so the commits from the other one are in here. You should probably rebase on the current master.

abergeron · 2016-11-25T04:52:07Z

+  if (!(X->flags & GA_ALIGNED) || !(Y->flags & GA_ALIGNED) ||
+      !(Z->flags & GA_ALIGNED))
+    return GA_UNALIGNED_ERROR;
+  if (X->dimensions[0] != n || Y->dimensions[0] != n)


You don't set n before this. Also you could just compare them to each other.

abergeron · 2016-11-25T04:53:00Z

+      Yp = &copyY;
+    }
+  }
+  if (Z->strides[0] < 0) {


Z doesn't have strides, it's a scalar.

abergeron · 2016-11-25T04:56:12Z

+
+  // we should store dot result on device
+  cublasGetPointerMode(h->h, &pmode);
+  cublasSetPointerMode(h->h, CUBLAS_POINTER_MODE_HOST);


I think you meant to set the mode to DEVICE here.

abergeron · 2016-11-25T04:57:07Z

+        size_t N,
+        gpudata *X, size_t offX, size_t incX,
+        gpudata *Y, size_t offY, size_t incY,
+        gpudata *Z) {


There should be an offset for Z.

abergeron · 2016-11-25T04:59:53Z

+
+  GA_CUDA_EXIT_ON_ERROR(ctx, cuda_wait(X, CUDA_WAIT_READ));
+  GA_CUDA_EXIT_ON_ERROR(ctx, cuda_wait(Y, CUDA_WAIT_READ));
+  GA_CUDA_EXIT_ON_ERROR(ctx, cuda_wait(Z, CUDA_WAIT_ALL));


We should only wait for write on Z since it is not read by dot.

abergeron · 2016-11-25T05:02:43Z

Also, please get rid of the merge commit and do a proper rebase.

abergeron · 2016-11-25T16:55:59Z

Just to note that you still have a merge commit. Just a different one now.

After a rebase, when you want to push to github, git will tell you to run git pull. This is wrong and you should run git push -f instead.

Otherwise, the code progresses nicely. Good job!

Lots of changes

Plus some minor changes: - did `chmod +x setup.py` - added interface for clblas

- Added tests for BLAS dot - Implementation for CLBlast - modified blas tests from using nested for loops to itertools.product for parametrized tests.

khaotik · 2016-11-27T05:19:40Z

@abergeron

Mostly done, just a few problems.

Changes

BLAS dot for libgpuarray, with CUDA/clBLAS/clBLAST backend
clblasSdot need a working buffer of size N, probably for sum reduction. I just made a naive implementation (allocate then release). I was about to make a static buffer that is shared between calls, however it's not thread safe. For now, there's some overhead with the naive impl.
pygpu binding and tests
changed all strides (inc* arguments) to type int for potential negative strides
I did this because it's specified in BLAS standard, and is supported in some CPU BLAS libs. Not sure if these GPU libraries would change in future.

abergeron · 2016-11-28T15:33:55Z

+  scratch_mem = clCreateBuffer(
+          ctx->ctx, CL_MEM_READ_WRITE, N*sizeof(float), NULL, &cl_err);
+  if (cl_err != CL_SUCCESS)
+      return GA_MEMORY_ERROR;


The temporary buffer allocation is ok, however please use cl_alloc() to allocate it so we can have one central location for all allocations.

abergeron · 2016-11-28T15:35:18Z

  error,
+  hdot, /* TODO */
+  sdot, /* TODO */
+  ddot, /* TODO */


These are not TODO.

abergeron · 2016-11-28T15:37:07Z

+
+  int (*hdot)( size_t N,
+    gpudata *X, size_t offX, int incX,
+    gpudata *Y, size_t offY, int incY,


Don't use int, use ssize_t for a signed type. We don't want to limit the size of arrays at this level.

It seems some BLAS functions are using int as stride (such as *gemv *ger) while others are using size_t. For now I just made a revert to size_t. I'm experimenting with ssize_t however getting test errors. ~~Still need some work to finish.~~ I think it's ready for jenkins test if no other issues. I'll leave stride problem to another PR.

abergeron · 2016-11-28T15:38:31Z

-  "                                  const float *x[], size_t incx, "   \
-  "                                  float *y[], size_t incy, "         \
+  "                                  const float *x[], int incx, "   \
+  "                                  float *y[], int incy, "         \


Here you changed the data type in the kernel without changing the declared data type. This will break badly.

abergeron · 2016-11-29T17:23:29Z

jenkins test this please

abergeron previously requested changes Nov 24, 2016

View reviewed changes

abergeron reviewed Nov 25, 2016

View reviewed changes

khaotik added 9 commits November 26, 2016 04:55

API for BLAS dot

28e1ed5

Finish BLAS dot for implementation for CUDA

136825b

Plus some minor changes: - did `chmod +x setup.py` - added interface for clblas

API for BLAS dot

94b600c

Finish BLAS dot for implementation for CUDA

c0e6663

Plus some minor changes: - did `chmod +x setup.py` - added interface for clblas

fix/cleanup

81e030c

fixed/more pygpu interface

556ced0

get rid of conflict

6b8518d

make all inc* arguments as type int

b5b20d7

tests for blas dot

bb84562

- Added tests for BLAS dot - Implementation for CLBlast - modified blas tests from using nested for loops to itertools.product for parametrized tests.

khaotik changed the title ~~[WIP] BLAS vector-vector dot~~ BLAS vector-vector dot Nov 27, 2016

khaotik added 2 commits November 27, 2016 04:25

finish dot for clBLAS

bec7a72

minifixes

1956a7b

abergeron requested changes Nov 28, 2016

View reviewed changes

khaotik added 5 commits November 29, 2016 06:50

fall back to size_t for strides

0b8cf5b

now use buffer_alloc to create working buffer

3bf6a41

revert old int strides

c74e8e6

revert strides in private.h

d6c69b6

mini cleanup

cae3671

abergeron approved these changes Nov 29, 2016

View reviewed changes

khaotik mentioned this pull request Nov 29, 2016

GPU gemv->dot speedup for new backend Theano/Theano#5303

Merged

abergeron merged commit 757f96d into Theano:master Nov 29, 2016

Conversation

khaotik commented Nov 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abergeron commented Nov 24, 2016

Uh oh!

abergeron commented Nov 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abergeron commented Nov 25, 2016

Uh oh!

abergeron commented Nov 25, 2016

Uh oh!

khaotik commented Nov 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khaotik Nov 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abergeron commented Nov 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khaotik Nov 29, 2016 •

edited

Loading